High-Value Token-Blocking: Efficient Blocking Method for Record Linkage
نویسندگان
چکیده
Data integration is an important component of Big analytics. One the key challenges in data record linkage, that is, matching records represent same real-world entity. Because computational costs, methods referred to as blocking are employed a part linkage pipeline order reduce number comparisons among records. In past decade, range techniques have been proposed. Real-world applications require approaches can handle heterogeneous sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), simple efficient approach for unsupervised schema-agnostic, based crafted use Term Frequency-Inverse Document Frequency. compare HVTB with multiple over datasets, including novel unstructured dataset composed titles abstracts scientific papers. thoroughly discuss results terms accuracy, resources, different characteristics datasets The simplicity yields fast computations does harm its accuracy when compared existing approaches. It shown be significantly superior other methods, suggesting simpler should considered before resorting more sophisticated methods.
منابع مشابه
Learning Blocking Schemes for Record Linkage
Record linkage is the process of matching records across data sets that refer to the same entity. One issue within record linkage is determining which record pairs to consider, since a detailed comparison between all of the records is impractical. Blocking addresses this issue by generating candidate matches as a preprocessing step for record linkage. For example, in a person matching problem, ...
متن کاملSecure Blocking + Secure Matching = Secure Record Linkage
Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we uti...
متن کاملTowards Parameter-free Blocking for Scalable Record Linkage
linking or matching databases is becoming increasingly important in many data mining projects, as linked data can contain information that is not available otherwise, or that would be too expensive to collect. a main challenge when linking large databases is the complexity of the linkage process: potentially each record in one database has to be compared with all records in the other database. ...
متن کاملA Comparison of Blocking Methods for Record Linkage
Record linkage seeks to merge databases and to remove duplicates when unique identifiers are not available. Most approaches use blocking techniques to reduce the computational complexity associated with record linkage. We review traditional blocking techniques, which typically partition the records according to a set of field attributes, and consider two variants of a method known as locality s...
متن کاملLeveraging Unlabeled Data to Scale Blocking for Record Linkage
Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: ACM Transactions on Knowledge Discovery From Data
سال: 2021
ISSN: ['1556-472X', '1556-4681']
DOI: https://doi.org/10.1145/3450527